Skip to content

Conversation

@Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Jan 4, 2026

Currently, the initial timeout for CRT_OPC_PROTO_QUERY RPC is only 3 seconds, it will help to get going more quickly when some rank(s) is down. But that increases the risk of query failure with timeout if there are only a few targets in the system and they may be busy or not ready in time when being queried.

The patch adds another one CRT_OPC_PROTO_QUERY RPC retry against the rank that has ever reported RPC timeout. Such retry will use default RPC timeout configuration instead of initial small value.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

github-actions bot commented Jan 4, 2026

Ticket title is 'daos_rpc_proto_query() crt_proto_query()failed: DER_TIMEDOUT(-1011): 'Time out''
Status is 'In Review'
Labels: 'scrubbed_2.8'
https://daosio.atlassian.net/browse/DAOS-18388

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/2/execution/node/451/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/2/execution/node/466/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18388 branch from b408c13 to 0863dd6 Compare January 5, 2026 06:39
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/5/execution/node/1176/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18388 branch 2 times, most recently from bd43c12 to bc0c86f Compare January 6, 2026 05:25
@Nasf-Fan Nasf-Fan marked this pull request as ready for review January 7, 2026 03:30
@Nasf-Fan Nasf-Fan requested review from a team as code owners January 7, 2026 03:30
knard38
knard38 previously approved these changes Jan 7, 2026
Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


/* More retry to the first timeout rank with default timeout. */
rank = rproto->first_timeout_rank;
rproto->timeout = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be rproto->timeout = timeout; in this case? the timeout queried from line 137 will be the 'default timeout' that you mention on line 151

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use rproto->timeout as a flag to indicate we have retried as L142. Related cart level logic will automatically set the new RPC timeout as the default timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why you need to treat 0 here as a special value.

if you set rproto->timeout = timeout, the logic on line 142 will still trigger on a next iteration ((timeout > 0 && timeout <= rproto->timeout)) part.

My concern is that setting it to 0 can lead to issues if someone later decided to do for example '+3', and instead of 'default timeout'+3 you now end up with a timeout of 3 seconds now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see your concern. I will refresh the patch.

@Nasf-Fan Nasf-Fan requested a review from frostedcmos January 9, 2026 03:04
@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented Jan 9, 2026

Ping reviewers, thanks!

@Nasf-Fan Nasf-Fan requested a review from jolivier23 January 9, 2026 05:33
@frostedcmos frostedcmos requested a review from mchaarawi January 9, 2026 17:58
Currently, the initial timeout for CRT_OPC_PROTO_QUERY RPC is only
3 seconds, it will help to get going more quickly when some rank(s)
is down. But that increases the risk of query failure with timeout
if there are only a few targets in the system and they may be busy
or not ready in time when being queried.

The patch adds another one CRT_OPC_PROTO_QUERY RPC retry against
the rank that has ever reported RPC timeout. Such retry will use
default RPC timeout configuration instead of initial small value.

Signed-off-by: Fan Yong <[email protected]>
@daosbuild3
Copy link
Collaborator

@johannlombardi
Copy link
Contributor

let's ask @mchaarawi to have a look at this patch. thanks.

@mchaarawi
Copy link
Contributor

let's ask @mchaarawi to have a look at this patch. thanks.

so the original implementation was endlessly looping and each time increasing the timeout. then @frostedcmos updated the implementation to just query and fail after all ranks tried:
#17049
the issues is that if we endlessly loop, we cause hangs on the client side, and 100% CPU consumption which was not acceptable for some customer use cases.

the problem with this PR, is that it takes the first rank and just retries with the default timeout for it. so IF that rank is truly down and not just busy, we are not resolving the actual problem reported by the ticket.
the better solution is to come up with a better timeout? IMO 3 seconds for a simple RPC should be enough, and i don't understand how an RPC latency is that high. it means there are problems really on the system if every engine latency for RPC is higher than 3 seconds, or general network issue.
but maybe we can increase that 3 seconds to something more reasonable. maybe 10 seconds? the default timeout is just so high and causes unnecessary delays in case of downed engines.

@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented Jan 12, 2026

let's ask @mchaarawi to have a look at this patch. thanks.

so the original implementation was endlessly looping and each time increasing the timeout. then @frostedcmos updated the implementation to just query and fail after all ranks tried: #17049 the issues is that if we endlessly loop, we cause hangs on the client side, and 100% CPU consumption which was not acceptable for some customer use cases.

the problem with this PR, is that it takes the first rank and just retries with the default timeout for it. so IF that rank is truly down and not just busy, we are not resolving the actual problem reported by the ticket. the better solution is to come up with a better timeout? IMO 3 seconds for a simple RPC should be enough, and i don't understand how an RPC latency is that high. it means there are problems really on the system if every engine latency for RPC is higher than 3 seconds, or general network issue. but maybe we can increase that 3 seconds to something more reasonable. maybe 10 seconds? the default timeout is just so high and causes unnecessary delays in case of downed engines.

One of the observed CI failure instance is that there was only one target in the system, the test logic call dmg system start, succeed, then immediately trigered daos_rpc_proto_query(), but got timeout after 3 seconds. On the server side, everything looked well, no failure, no workload.

I am not sure whether it is because control plane return succeed too early before server side network to be ready or not.

Another failure instance is that there were four targets in the system, all of them were alive. But during the four times query, all the four targets reported something like that:
pool_iv_ent_fetch() d6c86401 iv class id 3, master 3 is still stepping up.: DER_NOTLEADER(-2008): 'Not service leader'

Such logs message repeated on all four targets for about 30 seconds. But during such time windows, client side all the query RPCs got timeout.

As for the solution, this patch makes additional one query RPC to the first timeout target with default timeout. According to our observed instances, that is enough to resolve the CI failure. But as @mchaarawi pointed out, in the real environment, if the first bad target is really dead, whether this patch can work depends on the query RPC return failure: -DER_TIMEDOUT or some others.

Anyway, it is fine for me to increase the initial timeout from 3 seconds to some large value. But I do not know how large will be a good candidate.

@mchaarawi
Copy link
Contributor

mchaarawi commented Jan 12, 2026

let's ask @mchaarawi to have a look at this patch. thanks.

so the original implementation was endlessly looping and each time increasing the timeout. then @frostedcmos updated the implementation to just query and fail after all ranks tried: #17049 the issues is that if we endlessly loop, we cause hangs on the client side, and 100% CPU consumption which was not acceptable for some customer use cases.
the problem with this PR, is that it takes the first rank and just retries with the default timeout for it. so IF that rank is truly down and not just busy, we are not resolving the actual problem reported by the ticket. the better solution is to come up with a better timeout? IMO 3 seconds for a simple RPC should be enough, and i don't understand how an RPC latency is that high. it means there are problems really on the system if every engine latency for RPC is higher than 3 seconds, or general network issue. but maybe we can increase that 3 seconds to something more reasonable. maybe 10 seconds? the default timeout is just so high and causes unnecessary delays in case of downed engines.

One of the observed CI failure instance is that there was only one target in the system, the test logic call dmg system start, succeed, then immediately trigered daos_rpc_proto_query(), but got timeout after 3 seconds. On the server side, everything looked well, no failure, no workload.

I am not sure whether it is because control plane return succeed too early before server side network to be ready or not.

Another failure instance is that there were four targets in the system, all of them were alive. But during the four times query, all the four targets reported something like that: pool_iv_ent_fetch() d6c86401 iv class id 3, master 3 is still stepping up.: DER_NOTLEADER(-2008): 'Not service leader'

Such logs message repeated on all four targets for about 30 seconds. Then client side all the query RPCs timeout.

As for the solution, this patch will make additional one query RPC to the first timeout target with default timeout. According to our observed instances, that is enough to resolve the CI failure. But as @mchaarawi pointed out, in the real environment, if the first bad target is really dead, whether this patch can work depends on the query RPC return failure, -DER_TIMEDOUT or some others.

Anyway, it is fine for me to increase the initial timeout from 3 seconds to some large value. But I do not know how large will be a good candidate.

well in this case, the issue is in CI / system test and it looks like we have to either wait for the server to full start before issuing the RPC (pool create or cont create or whatever), or the test / user retry the operation in case we get a timeout.
it seems like a hack to just fix this for CI use case TBH and adding more complexity to the protoquery is unnecessary IMO.

Alternatively, we can add just a testing env variable that increases the proto query timeout to like 30 seconds. So by default, we use 3 seconds, but for those failing test cases in CI we set that env variable to some larger value.

@Nasf-Fan
Copy link
Contributor Author

let's ask @mchaarawi to have a look at this patch. thanks.

so the original implementation was endlessly looping and each time increasing the timeout. then @frostedcmos updated the implementation to just query and fail after all ranks tried: #17049 the issues is that if we endlessly loop, we cause hangs on the client side, and 100% CPU consumption which was not acceptable for some customer use cases.
the problem with this PR, is that it takes the first rank and just retries with the default timeout for it. so IF that rank is truly down and not just busy, we are not resolving the actual problem reported by the ticket. the better solution is to come up with a better timeout? IMO 3 seconds for a simple RPC should be enough, and i don't understand how an RPC latency is that high. it means there are problems really on the system if every engine latency for RPC is higher than 3 seconds, or general network issue. but maybe we can increase that 3 seconds to something more reasonable. maybe 10 seconds? the default timeout is just so high and causes unnecessary delays in case of downed engines.

One of the observed CI failure instance is that there was only one target in the system, the test logic call dmg system start, succeed, then immediately trigered daos_rpc_proto_query(), but got timeout after 3 seconds. On the server side, everything looked well, no failure, no workload.
I am not sure whether it is because control plane return succeed too early before server side network to be ready or not.
Another failure instance is that there were four targets in the system, all of them were alive. But during the four times query, all the four targets reported something like that: pool_iv_ent_fetch() d6c86401 iv class id 3, master 3 is still stepping up.: DER_NOTLEADER(-2008): 'Not service leader'
Such logs message repeated on all four targets for about 30 seconds. Then client side all the query RPCs timeout.
As for the solution, this patch will make additional one query RPC to the first timeout target with default timeout. According to our observed instances, that is enough to resolve the CI failure. But as @mchaarawi pointed out, in the real environment, if the first bad target is really dead, whether this patch can work depends on the query RPC return failure, -DER_TIMEDOUT or some others.
Anyway, it is fine for me to increase the initial timeout from 3 seconds to some large value. But I do not know how large will be a good candidate.

well in this case, the issue is in CI / system test and it looks like we have to either wait for the server to full start before issuing the RPC (pool create or cont create or whatever), or the test / user retry the operation in case we get a timeout. it seems like a hack to just fix this for CI use case TBH and adding more complexity to the protoquery is unnecessary IMO.

Alternatively, we can add just a testing env variable that increases the proto query timeout to like 30 seconds. So by default, we use 3 seconds, but for those failing test cases in CI we set that env variable to some larger value.

This 3 seconds timeout is not configuration via environment, instead, it is hard coded inside the logic:

`int
daos_rpc_proto_query(crt_opcode_t base_opc, uint32_t *ver_array, int count, int *ret_ver)
{
...
rproto->timeout = 3;

    rc = crt_proto_query_with_ctx(&rproto->ep, base_opc, ver_array, count, rproto->timeout,
                                  query_cb, rproto, ctx);
    if (rc) {

...
`

@Nasf-Fan
Copy link
Contributor Author

well in this case, the issue is in CI / system test and it looks like we have to either wait for the server to full start before issuing the RPC (pool create or cont create or whatever), or the test / user retry the operation in case we get a timeout. it seems like a hack to just fix this for CI use case TBH and adding more complexity to the protoquery is unnecessary IMO.

I am not sure whether it is easy for control plane (or test logic) to exactly to detect such ready status. If not easy or need more time to wait, then current hack solution can be used as some kind of workaround?

@mchaarawi
Copy link
Contributor

well in this case, the issue is in CI / system test and it looks like we have to either wait for the server to full start before issuing the RPC (pool create or cont create or whatever), or the test / user retry the operation in case we get a timeout. it seems like a hack to just fix this for CI use case TBH and adding more complexity to the protoquery is unnecessary IMO.

I am not sure whether it is easy for control plane (or test logic) to exactly to detect such ready status. If not easy or need more time to wait, then current hack solution can be used as some kind of workaround?

to me it's not a great idea to add core code and make it more complicated just to get around test issues.

This 3 seconds timeout is not configuration via environment, instead, it is hard coded inside the logic:

right, so i meant we make it configurable or actually retry the command. do you mind if i assign this issue to myself?

@mchaarawi
Copy link
Contributor

@Nasf-Fan For a quick fix, I would suggest to just bump the 3 seconds hardcoded timeout to 30 (but keep increment by 3 on next rank if timeout). this should be more than enough on smaller system of 1 or 2 ranks to get a response. I will work on a more generic approach for later.
originally we used 3sec because the agent was returning dead ranks for proto-query. but @knard38 fixed that so we can use a bigger timeout without big consequences.

does that sounds reasonable to you @Nasf-Fan?

@frostedcmos
Copy link
Contributor

@Nasf-Fan For a quick fix, I would suggest to just bump the 3 seconds hardcoded timeout to 30 (but keep increment by 3 on next rank if timeout). this should be more than enough on smaller system of 1 or 2 ranks to get a response. I will work on a more generic approach for later. originally we used 3sec because the agent was returning dead ranks for proto-query. but @knard38 fixed that so we can use a bigger timeout without big consequences.

does that sounds reasonable to you @Nasf-Fan?

Something to keep in mind with this

If the first rpc is sent prematurely just before an engine is fully up, then bumping a timeout to 30 seconds can make everything delayed by 30 seconds as the rpc would need to be resent (on some providers). As such it's preferrable to start with smaller timeout values and gradually bump them up.

On the other hand, on some other providers an attempt to reach an engine that is not up can result in an immediate failure, without full timeout expiring. In such cases a client that starts prematurely can end up cycling through all engines in no time.

A different approach might be to keep something like 'total_wait_time' in the proto query and exit the retry loop when the wait time exceeds some threshold (perhaps env configurable), while keeping individual proto query rpc timeouts relatively short. Wait time will then need to be calculated as actual time spent on a client, instead of relying on a timeout value.

My thinking here is that if we can't get proto query to succeed with 5-10second timeout after repeated retries to different engines, then we have some serious issues already and shouldn't continue, as it would fail anyway and make further debug more complicated.

@Nasf-Fan
Copy link
Contributor Author

well in this case, the issue is in CI / system test and it looks like we have to either wait for the server to full start before issuing the RPC (pool create or cont create or whatever), or the test / user retry the operation in case we get a timeout. it seems like a hack to just fix this for CI use case TBH and adding more complexity to the protoquery is unnecessary IMO.

I am not sure whether it is easy for control plane (or test logic) to exactly to detect such ready status. If not easy or need more time to wait, then current hack solution can be used as some kind of workaround?

to me it's not a great idea to add core code and make it more complicated just to get around test issues.

This 3 seconds timeout is not configuration via environment, instead, it is hard coded inside the logic:

right, so i meant we make it configurable or actually retry the command. do you mind if i assign this issue to myself?

I did not know whom was the right one to handle such proto query issue, then I made the workaround patch by myself.
It will be good if you can work out more general solution. But consider @frostedcmos comment, it seems not good idea if we dump the timeout value to 30 seconds for the initial proto query RPC.

Anyway, I will assign DAOS-18388 to you @mchaarawi

@mchaarawi
Copy link
Contributor

@Nasf-Fan For a quick fix, I would suggest to just bump the 3 seconds hardcoded timeout to 30 (but keep increment by 3 on next rank if timeout). this should be more than enough on smaller system of 1 or 2 ranks to get a response. I will work on a more generic approach for later. originally we used 3sec because the agent was returning dead ranks for proto-query. but @knard38 fixed that so we can use a bigger timeout without big consequences.
does that sounds reasonable to you @Nasf-Fan?

Something to keep in mind with this

If the first rpc is sent prematurely just before an engine is fully up, then bumping a timeout to 30 seconds can make everything delayed by 30 seconds as the rpc would need to be resent (on some providers). As such it's preferrable to start with smaller timeout values and gradually bump them up.

yes that is fine really. in production this will not happen. in CI, it means we will have a 30 sec timeout, but by that time the engines should be up (on those 1 or 2 engines system that seem to be the ones failing this test).

On the other hand, on some other providers an attempt to reach an engine that is not up can result in an immediate failure, without full timeout expiring. In such cases a client that starts prematurely can end up cycling through all engines in no time.

A different approach might be to keep something like 'total_wait_time' in the proto query and exit the retry loop when the wait time exceeds some threshold (perhaps env configurable), while keeping individual proto query rpc timeouts relatively short. Wait time will then need to be calculated as actual time spent on a client, instead of relying on a timeout value.

but what is this total wait time supposed to be? originally we thought just retrying forever is fine. i think this is really again an unnecessary complication to this.

My thinking here is that if we can't get proto query to succeed with 5-10second timeout after repeated retries to different engines, then we have some serious issues already and shouldn't continue, as it would fail anyway and make further debug more complicated.

you are correct, if we have a normal system with several engines. the problem here is that there are some tests with just 1 or 2 engines. i think just bumping the initial timeout to 30 seconds should be OK. maybe 30 is an overkill and 10 is more appropriate. originally it was 60, but the complaint was that we were always querying dead engines in many cases. but as i said, that issue was fixed.
the problem now is that after your change, there are some test cases that are failing bec there is like 1 engine int he system, and after the first timeout, it exits and the test never retries. ideally we can fix the test to retry the operation in those cases, but as a quick fix, just bumping the timeout a bit is easier.

@Nasf-Fan
Copy link
Contributor Author

@Nasf-Fan For a quick fix, I would suggest to just bump the 3 seconds hardcoded timeout to 30 (but keep increment by 3 on next rank if timeout). this should be more than enough on smaller system of 1 or 2 ranks to get a response. I will work on a more generic approach for later. originally we used 3sec because the agent was returning dead ranks for proto-query. but @knard38 fixed that so we can use a bigger timeout without big consequences.
does that sounds reasonable to you @Nasf-Fan?

Something to keep in mind with this
If the first rpc is sent prematurely just before an engine is fully up, then bumping a timeout to 30 seconds can make everything delayed by 30 seconds as the rpc would need to be resent (on some providers). As such it's preferrable to start with smaller timeout values and gradually bump them up.

yes that is fine really. in production this will not happen. in CI, it means we will have a 30 sec timeout, but by that time the engines should be up (on those 1 or 2 engines system that seem to be the ones failing this test).

On the other hand, on some other providers an attempt to reach an engine that is not up can result in an immediate failure, without full timeout expiring. In such cases a client that starts prematurely can end up cycling through all engines in no time.
A different approach might be to keep something like 'total_wait_time' in the proto query and exit the retry loop when the wait time exceeds some threshold (perhaps env configurable), while keeping individual proto query rpc timeouts relatively short. Wait time will then need to be calculated as actual time spent on a client, instead of relying on a timeout value.

but what is this total wait time supposed to be? originally we thought just retrying forever is fine. i think this is really again an unnecessary complication to this.

My thinking here is that if we can't get proto query to succeed with 5-10second timeout after repeated retries to different engines, then we have some serious issues already and shouldn't continue, as it would fail anyway and make further debug more complicated.

you are correct, if we have a normal system with several engines. the problem here is that there are some tests with just 1 or 2 engines. i think just bumping the initial timeout to 30 seconds should be OK. maybe 30 is an overkill and 10 is more appropriate. originally it was 60, but the complaint was that we were always querying dead engines in many cases. but as i said, that issue was fixed. the problem now is that after your change, there are some test cases that are failing bec there is like 1 engine int he system, and after the first timeout, it exits and the test never retries. ideally we can fix the test to retry the operation in those cases, but as a quick fix, just bumping the timeout a bit is easier.

If bumping the timeout is acceptable, then my current patch maybe better since it will not increase the wait time for most of normal cases if without RPC real timeout. That is equal to the case without my patch; but if all engines are not ready in time (such as the CI test failure cases), then my patch will introduce at most 60 seconds additional wait.

@mchaarawi
Copy link
Contributor

@Nasf-Fan For a quick fix, I would suggest to just bump the 3 seconds hardcoded timeout to 30 (but keep increment by 3 on next rank if timeout). this should be more than enough on smaller system of 1 or 2 ranks to get a response. I will work on a more generic approach for later. originally we used 3sec because the agent was returning dead ranks for proto-query. but @knard38 fixed that so we can use a bigger timeout without big consequences.
does that sounds reasonable to you @Nasf-Fan?

Something to keep in mind with this
If the first rpc is sent prematurely just before an engine is fully up, then bumping a timeout to 30 seconds can make everything delayed by 30 seconds as the rpc would need to be resent (on some providers). As such it's preferrable to start with smaller timeout values and gradually bump them up.

yes that is fine really. in production this will not happen. in CI, it means we will have a 30 sec timeout, but by that time the engines should be up (on those 1 or 2 engines system that seem to be the ones failing this test).

On the other hand, on some other providers an attempt to reach an engine that is not up can result in an immediate failure, without full timeout expiring. In such cases a client that starts prematurely can end up cycling through all engines in no time.
A different approach might be to keep something like 'total_wait_time' in the proto query and exit the retry loop when the wait time exceeds some threshold (perhaps env configurable), while keeping individual proto query rpc timeouts relatively short. Wait time will then need to be calculated as actual time spent on a client, instead of relying on a timeout value.

but what is this total wait time supposed to be? originally we thought just retrying forever is fine. i think this is really again an unnecessary complication to this.

My thinking here is that if we can't get proto query to succeed with 5-10second timeout after repeated retries to different engines, then we have some serious issues already and shouldn't continue, as it would fail anyway and make further debug more complicated.

you are correct, if we have a normal system with several engines. the problem here is that there are some tests with just 1 or 2 engines. i think just bumping the initial timeout to 30 seconds should be OK. maybe 30 is an overkill and 10 is more appropriate. originally it was 60, but the complaint was that we were always querying dead engines in many cases. but as i said, that issue was fixed. the problem now is that after your change, there are some test cases that are failing bec there is like 1 engine int he system, and after the first timeout, it exits and the test never retries. ideally we can fix the test to retry the operation in those cases, but as a quick fix, just bumping the timeout a bit is easier.

If bumping the timeout is acceptable, then my current patch maybe better since it will not increase the wait time for most of normal cases if without RPC real timeout. That is equal to the case without my patch; but if all engines are not ready in time (such as the CI test failure cases), then my patch will introduce at most 60 seconds additional wait.

The reason im pushing back on your patch is it that it makes this proto-query unnecessarily more complicated for a non-production use case just for CI.
bumping the timeout will work for CI use case IMO.
for production use case, it is not going to affect things much bec we do not return dead ranks anymore if the agent is refreshed. if the agent is not refreshed and it returns a dead rank, it introduces a higher timeout, but just tells the user to refresh their agent on the node they are using.

@Nasf-Fan
Copy link
Contributor Author

@Nasf-Fan For a quick fix, I would suggest to just bump the 3 seconds hardcoded timeout to 30 (but keep increment by 3 on next rank if timeout). this should be more than enough on smaller system of 1 or 2 ranks to get a response. I will work on a more generic approach for later. originally we used 3sec because the agent was returning dead ranks for proto-query. but @knard38 fixed that so we can use a bigger timeout without big consequences.
does that sounds reasonable to you @Nasf-Fan?

Something to keep in mind with this
If the first rpc is sent prematurely just before an engine is fully up, then bumping a timeout to 30 seconds can make everything delayed by 30 seconds as the rpc would need to be resent (on some providers). As such it's preferrable to start with smaller timeout values and gradually bump them up.

yes that is fine really. in production this will not happen. in CI, it means we will have a 30 sec timeout, but by that time the engines should be up (on those 1 or 2 engines system that seem to be the ones failing this test).

On the other hand, on some other providers an attempt to reach an engine that is not up can result in an immediate failure, without full timeout expiring. In such cases a client that starts prematurely can end up cycling through all engines in no time.
A different approach might be to keep something like 'total_wait_time' in the proto query and exit the retry loop when the wait time exceeds some threshold (perhaps env configurable), while keeping individual proto query rpc timeouts relatively short. Wait time will then need to be calculated as actual time spent on a client, instead of relying on a timeout value.

but what is this total wait time supposed to be? originally we thought just retrying forever is fine. i think this is really again an unnecessary complication to this.

My thinking here is that if we can't get proto query to succeed with 5-10second timeout after repeated retries to different engines, then we have some serious issues already and shouldn't continue, as it would fail anyway and make further debug more complicated.

you are correct, if we have a normal system with several engines. the problem here is that there are some tests with just 1 or 2 engines. i think just bumping the initial timeout to 30 seconds should be OK. maybe 30 is an overkill and 10 is more appropriate. originally it was 60, but the complaint was that we were always querying dead engines in many cases. but as i said, that issue was fixed. the problem now is that after your change, there are some test cases that are failing bec there is like 1 engine int he system, and after the first timeout, it exits and the test never retries. ideally we can fix the test to retry the operation in those cases, but as a quick fix, just bumping the timeout a bit is easier.

If bumping the timeout is acceptable, then my current patch maybe better since it will not increase the wait time for most of normal cases if without RPC real timeout. That is equal to the case without my patch; but if all engines are not ready in time (such as the CI test failure cases), then my patch will introduce at most 60 seconds additional wait.

The reason im pushing back on your patch is it that it makes this proto-query unnecessarily more complicated for a non-production use case just for CI. bumping the timeout will work for CI use case IMO. for production use case, it is not going to affect things much bec we do not return dead ranks anymore if the agent is refreshed. if the agent is not refreshed and it returns a dead rank, it introduces a higher timeout, but just tells the user to refresh their agent on the node they are using.

Bumping the initial timeout will not only affect CI, but also may make the real production system to wait more time?

@mchaarawi
Copy link
Contributor

mchaarawi commented Jan 14, 2026

The reason im pushing back on your patch is it that it makes this proto-query unnecessarily more complicated for a non-production use case just for CI. bumping the timeout will work for CI use case IMO. for production use case, it is not going to affect things much bec we do not return dead ranks anymore if the agent is refreshed. if the agent is not refreshed and it returns a dead rank, it introduces a higher timeout, but just tells the user to refresh their agent on the node they are using.

Bumping the initial timeout will not only affect CI, but also may make the real production system to wait more time?

i just explained that in the previous comment:
"for production use case, it is not going to affect things much bec we do not return dead ranks anymore if the agent is refreshed. if the agent is not refreshed and it returns a dead rank, it introduces a higher timeout, but just tells the user to refresh their agent on the node they are using."

and we can bump the timeout to maybe something like 10/20 seconds, not the default of 60 seconds.

@Nasf-Fan
Copy link
Contributor Author

The reason im pushing back on your patch is it that it makes this proto-query unnecessarily more complicated for a non-production use case just for CI. bumping the timeout will work for CI use case IMO. for production use case, it is not going to affect things much bec we do not return dead ranks anymore if the agent is refreshed. if the agent is not refreshed and it returns a dead rank, it introduces a higher timeout, but just tells the user to refresh their agent on the node they are using.

Bumping the initial timeout will not only affect CI, but also may make the real production system to wait more time?

i just explained that in the previous comment: "for production use case, it is not going to affect things much bec we do not return dead ranks anymore if the agent is refreshed. if the agent is not refreshed and it returns a dead rank, it introduces a higher timeout, but just tells the user to refresh their agent on the node they are using."

and we can bump the timeout to maybe something like 10/20 seconds, not the default of 60 seconds.

OK, I will close this one, and assigned to you for more general solution.

@Nasf-Fan Nasf-Fan closed this Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

7 participants